-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
5 summarise clusters #6
Conversation
* Add cluster naming/summarisation step * Add basic semantic chunking script for comparison with other datasets
* There is now one set of scripts for a bottom-up analysis * A streamlit app to visualise the output of the bottom-up analysis * One script to execute a top-down analysis
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you Rosie, I've looked through most of the code and test ran it - it's all good!
I've made a few small comments (in addition to the ones I sent over Slack) - take a look and see if you'd like to address them now.
Otherwise, happy for you to merge and continue with the evals and dashboard.
I also added a few tests that helped me better understand the data processing functions. Didn't have time to do "test-driven reviewing" for the rest of the code, however.
# Fix improperly coded characters | ||
data_df['text_clean'] = data_df['text'].apply(lambda x: ftfy.fix_text(x)) | ||
|
||
data_df["text_clean"] = data_df["text"].apply(lambda x: ftfy.fix_text(x)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, I didn't now about ftfy.fix_text
def name_topics( | ||
topic_info: pd.DataFrame, | ||
llm_chain, | ||
topics: List[str], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess this topics
variable could be removed and you could move the line
topics = topic_info["Cluster"].unique().tolist()
inside this function, to simplify the input variables.
|
||
if __name__ == "__main__": | ||
|
||
errors = [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you doing anything with this list in this script?
|
||
def match_questions(bot_qs_list: List[str], questions: List[str], threshold: float = 0.85) -> pd.DataFrame: | ||
"""Match the input questions from our interview template, | ||
and the actual questions produced by the bot, using cosine similarity. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and the actual questions produced by the bot, using cosine similarity. | |
and the actual sentences produced by the bot, using cosine similarity. |
I guess you're not just matching questions but in fact all sentences?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, yes this is correct! Good spot
interviews_cleaned_df = interviews_df.groupby("conversation").apply(remove_preamble).reset_index(drop=True) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't look like you're using or saving interviews_cleaned_df
after this...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also a good spot - thank you!!
topic_counts = pd.DataFrame(data["Cluster"].value_counts()).reset_index() | ||
topic_counts = topic_counts.rename(columns={"count": "N responses in topic"}) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor comment, but I would like us to use pandas chaining as much as possible as it makes it easier to read the code.
topic_counts = (
pd.DataFrame(data["Cluster"].value_counts())
.reset_index()
.rename(columns={"count": "N responses in topic"})
)
data_w_names = data_w_names.rename(columns={"llama3.2_name": "Name", "llama3.2_description": "Description"}) | ||
data_w_names = pd.merge(data_w_names, topic_counts, left_on="Cluster", right_on="Cluster", how="left") | ||
|
||
data_w_names[["Name", "Description", "Top Words", "N responses in topic"]].to_csv(OUTPUT_PATH_SUMMARY, index=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same point here about chains
data_viz = pd.merge(data, data_w_names[["Cluster", "Name", "Description"]], on="Cluster", how="left") | ||
|
||
data_viz["Name"] = data_viz["Name"].fillna("None") | ||
data_viz["Description"] = data_viz["Description"].fillna("None") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also can use chains here and, for example, the function .assign(Name = lambda df: df['Name'].fillna("None"))
…ew_transcripts into 5-summarise-clusters
Co-authored-by: Karlis Kanders <[email protected]>
Co-authored-by: Karlis Kanders <[email protected]>
Fixes issues #5 (topic summarisation) and #7 (semantic chunking)
To review:
python dsp_interview_transcripts/pipeline/run_pipeline.py
/ follow the instructions in this readme. Check that everything runs and the outputs are as expected.There is also a notebook on semantic chunking but only look at that if you are interested - we're not taking that route for now.